Extract Text¶

Text extraction from documents and in-memory binaries with optional OCR. Designed for heterogeneous inputs (PDF, Office, images, CSV, HTML/XML) and selective processing of pages and sheets. When ocr=True, raster pages are recognized with Tesseract while digitally born text is read directly. Returns a single Unicode string.

Note

For OCR capabilities, ensure you have Tesseract installed on your system.

Supported formats¶

Format	OCR option
PDF	Optional
DOC	No
DOCX	Optional
XLSX	No
XLS	No
TXT	No
CSV	No
JSON	No
HTML	No
HTM	No
TIF	Default
TIFF	Default
JPG/JPEG	Default
PNG	Default
GIF	Default

Parameters¶

Parameter	Description
`input_data`	(str \| bytes \| Path) Source to extract from: path string, bytes, or `pathlib.Path`.
`extension`	(str, optional) File extension when `input_data` is bytes (e.g., `"pdf"`, `"png"`, `"xlsx"`).
`pages`	(int \| str \| list[int\|str] \| None) Page/sheet selection. For paged formats use numbers and ranges (`1`, `"2-5"`, `[1, "5-7"]`). For spreadsheets pass sheet index, name, or a mixed list.
`ocr`	(bool, optional) Enable Tesseract OCR for images and scanned PDFs/ DOCX. Defaults to `False`.
`language_ocr`	(str, optional) Tesseract language code. Defaults to `"eng"`.

Detailed parameters and examples¶

`input_data`¶

Accepts a filesystem path, a pathlib.Path, or raw bytes.

Path string

import textwizard as tw
text = tw.extract_text("docs/report.pdf")

pathlib.Path

from pathlib import Path
import textwizard as tw
text = tw.extract_text(Path("docs/report.pdf"))

Bytes (must set ``extension``)

from pathlib import Path
import textwizard as tw
raw = Path("img.png").read_bytes()
text = tw.extract_text(raw, extension="png")

BytesIO (streams)

import io, textwizard as tw
buf = io.BytesIO(open("img.png", "rb").read())
text = tw.extract_text(buf.getvalue(), extension="png")

`extension`¶

Required only when passing bytes. Indicates the file type.

Example

import textwizard as tw
png_bytes = open("img.png", "rb").read()
text = tw.extract_text(png_bytes, extension="png")

Warning

Passing bytes without extension raises a validation error.

`pages`¶

Select pages (PDF/DOCX/TIFF) or sheets (XLSX/XLS).

Accepted forms by format:

Paged (PDF, DOCX, TIFF) — 1-based: - single int: 1 - range string: "1-3" - CSV string: "1,3,5-7" - mixed list: [1, 3, "5-7"] Invalid tokens and out-of-range pages are silently skipped.
Spreadsheets (XLSX/XLS): - sheet index 0-based (int) — e.g. 0 - sheet name (str) — e.g. "Summary" - list of the above — e.g. [0, "Q4", 5, 6] Range strings like "5-7" are not supported for sheets; use explicit indices ([5, 6, 7]).
Images: - JPG/PNG/GIF: page selection is ignored (single frame). - TIFF multipage: pass a list of 1-based integers (e.g. [1, 3, 5]).
TXT/CSV/HTML/JSON: pages is ignored.

Examples — paged

import textwizard as tw
page1  = tw.extract_text("docs/big.pdf", pages=1)
subset = tw.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])

Examples — spreadsheets

import textwizard as tw
first  = tw.extract_text("tables.xlsx", pages=0)               # first sheet (0-based)
named  = tw.extract_text("tables.xlsx", pages="Summary")       # sheet by name
multi  = tw.extract_text("tables.xlsx", pages=[0, "Q4", 5, 6]) # explicit indices; no "5-7"

Enable OCR for raster content and scanned documents. language_ocr controls the recognition language.

Images

import textwizard as tw
img_txt = tw.extract_text("scan.tiff", ocr=True)               # default 'eng'

Scanned PDF

import textwizard as tw
pdf_txt = tw.extract_text("contract_scanned.pdf", ocr=True, language_ocr="ita")

Returns¶

str — concatenated Unicode text from the selected pages/sheets.

Errors¶

Bytes without extension → validation error.
Unsupported or invalid input → domain-specific error.
Missing or unreadable file → I/O error.

Extract Text¶

Supported formats¶

Parameters¶

Detailed parameters and examples¶

input_data¶

extension¶

pages¶

Returns¶

Errors¶

See also¶

`input_data`¶

`extension`¶

`pages`¶