Extract Text¶
Text extraction from documents and in-memory binaries with optional OCR.
Designed for heterogeneous inputs (PDF, Office, images, CSV, HTML/XML) and selective processing of pages and sheets. When ocr=True, raster pages are recognized with Tesseract while digitally born text is read directly. Returns a single Unicode string.
Note
For OCR capabilities, ensure you have Tesseract installed on your system.
Supported formats¶
Format |
OCR option |
|---|---|
Optional |
|
DOC |
No |
DOCX |
Optional |
XLSX |
No |
XLS |
No |
TXT |
No |
CSV |
No |
JSON |
No |
HTML |
No |
HTM |
No |
TIF |
Default |
TIFF |
Default |
JPG/JPEG |
Default |
PNG |
Default |
GIF |
Default |
Parameters¶
Parameter |
Description |
|---|---|
|
(str | bytes | Path) Source to extract from: path string, bytes, or
|
|
(str, optional) File extension when |
|
(int | str | list[int|str] | None) Page/sheet selection. For paged
formats use numbers and ranges ( |
|
(bool, optional) Enable Tesseract OCR for images and scanned PDFs/
DOCX. Defaults to |
|
(str, optional) Tesseract language code. Defaults to |
Detailed parameters and examples¶
input_data¶
Accepts a filesystem path, a pathlib.Path, or raw bytes.
Path string
import textwizard as tw
text = tw.extract_text("docs/report.pdf")
pathlib.Path
from pathlib import Path
import textwizard as tw
text = tw.extract_text(Path("docs/report.pdf"))
Bytes (must set ``extension``)
from pathlib import Path
import textwizard as tw
raw = Path("img.png").read_bytes()
text = tw.extract_text(raw, extension="png")
BytesIO (streams)
import io, textwizard as tw
buf = io.BytesIO(open("img.png", "rb").read())
text = tw.extract_text(buf.getvalue(), extension="png")
extension¶
Required only when passing bytes. Indicates the file type.
Example
import textwizard as tw
png_bytes = open("img.png", "rb").read()
text = tw.extract_text(png_bytes, extension="png")
Warning
Passing bytes without extension raises a validation error.
pages¶
Select pages (PDF/DOCX/TIFF) or sheets (XLSX/XLS).
Accepted forms by format:
Paged (PDF, DOCX, TIFF) — 1-based: - single int:
1- range string:"1-3"- CSV string:"1,3,5-7"- mixed list:[1, 3, "5-7"]Invalid tokens and out-of-range pages are silently skipped.Spreadsheets (XLSX/XLS): - sheet index 0-based (
int) — e.g.0- sheet name (str) — e.g."Summary"- list of the above — e.g.[0, "Q4", 5, 6]Range strings like"5-7"are not supported for sheets; use explicit indices ([5, 6, 7]).Images: - JPG/PNG/GIF: page selection is ignored (single frame). - TIFF multipage: pass a list of 1-based integers (e.g.
[1, 3, 5]).TXT/CSV/HTML/JSON:
pagesis ignored.
Examples — paged
import textwizard as tw
page1 = tw.extract_text("docs/big.pdf", pages=1)
subset = tw.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])
Examples — spreadsheets
import textwizard as tw
first = tw.extract_text("tables.xlsx", pages=0) # first sheet (0-based)
named = tw.extract_text("tables.xlsx", pages="Summary") # sheet by name
multi = tw.extract_text("tables.xlsx", pages=[0, "Q4", 5, 6]) # explicit indices; no "5-7"
Enable OCR for raster content and scanned documents. language_ocr controls the recognition language.
Images
import textwizard as tw
img_txt = tw.extract_text("scan.tiff", ocr=True) # default 'eng'
Scanned PDF
import textwizard as tw
pdf_txt = tw.extract_text("contract_scanned.pdf", ocr=True, language_ocr="ita")
Returns¶
str — concatenated Unicode text from the selected pages/sheets.
Errors¶
Bytes without
extension→ validation error.Unsupported or invalid input → domain-specific error.
Missing or unreadable file → I/O error.
See also¶
Extract Text (Azure) — Cloud OCR for text, tables, and key-value
TextWizard — Overview and quick start