Extract Text

Text extraction from documents and in-memory binaries with optional OCR. Designed for heterogeneous inputs (PDF, Office, images, CSV, HTML/XML) and selective processing of pages and sheets. When ocr=True, raster pages are recognized with Tesseract while digitally born text is read directly. Returns a single Unicode string.

Note

For OCR capabilities, ensure you have Tesseract installed on your system.

Supported formats

Format

OCR option

PDF

Optional

DOC

No

DOCX

Optional

XLSX

No

XLS

No

TXT

No

CSV

No

JSON

No

HTML

No

HTM

No

TIF

Default

TIFF

Default

JPG/JPEG

Default

PNG

Default

GIF

Default

Parameters

Parameter

Description

input_data

(str | bytes | Path) Source to extract from: path string, bytes, or pathlib.Path.

extension

(str, optional) File extension when input_data is bytes (e.g., "pdf", "png", "xlsx").

pages

(int | str | list[int|str] | None) Page/sheet selection. For paged formats use numbers and ranges (1, "2-5", [1, "5-7"]). For spreadsheets pass sheet index, name, or a mixed list.

ocr

(bool, optional) Enable Tesseract OCR for images and scanned PDFs/ DOCX. Defaults to False.

language_ocr

(str, optional) Tesseract language code. Defaults to "eng".

Detailed parameters and examples

input_data

Accepts a filesystem path, a pathlib.Path, or raw bytes.

Path string

import textwizard as tw
text = tw.extract_text("docs/report.pdf")

pathlib.Path

from pathlib import Path
import textwizard as tw
text = tw.extract_text(Path("docs/report.pdf"))

Bytes (must set ``extension``)

from pathlib import Path
import textwizard as tw
raw = Path("img.png").read_bytes()
text = tw.extract_text(raw, extension="png")

BytesIO (streams)

import io, textwizard as tw
buf = io.BytesIO(open("img.png", "rb").read())
text = tw.extract_text(buf.getvalue(), extension="png")

extension

Required only when passing bytes. Indicates the file type.

Example

import textwizard as tw
png_bytes = open("img.png", "rb").read()
text = tw.extract_text(png_bytes, extension="png")

Warning

Passing bytes without extension raises a validation error.

pages

Select pages (PDF/DOCX/TIFF) or sheets (XLSX/XLS).

Accepted forms by format:

  • Paged (PDF, DOCX, TIFF) — 1-based: - single int: 1 - range string: "1-3" - CSV string: "1,3,5-7" - mixed list: [1, 3, "5-7"] Invalid tokens and out-of-range pages are silently skipped.

  • Spreadsheets (XLSX/XLS): - sheet index 0-based (int) — e.g. 0 - sheet name (str) — e.g. "Summary" - list of the above — e.g. [0, "Q4", 5, 6] Range strings like "5-7" are not supported for sheets; use explicit indices ([5, 6, 7]).

  • Images: - JPG/PNG/GIF: page selection is ignored (single frame). - TIFF multipage: pass a list of 1-based integers (e.g. [1, 3, 5]).

  • TXT/CSV/HTML/JSON: pages is ignored.

Examples — paged

import textwizard as tw
page1  = tw.extract_text("docs/big.pdf", pages=1)
subset = tw.extract_text("docs/big.pdf", pages=[1, 3, "5-7"])

Examples — spreadsheets

import textwizard as tw
first  = tw.extract_text("tables.xlsx", pages=0)               # first sheet (0-based)
named  = tw.extract_text("tables.xlsx", pages="Summary")       # sheet by name
multi  = tw.extract_text("tables.xlsx", pages=[0, "Q4", 5, 6]) # explicit indices; no "5-7"

Enable OCR for raster content and scanned documents. language_ocr controls the recognition language.

Images

import textwizard as tw
img_txt = tw.extract_text("scan.tiff", ocr=True)               # default 'eng'

Scanned PDF

import textwizard as tw
pdf_txt = tw.extract_text("contract_scanned.pdf", ocr=True, language_ocr="ita")

Returns

str — concatenated Unicode text from the selected pages/sheets.

Errors

  • Bytes without extension → validation error.

  • Unsupported or invalid input → domain-specific error.

  • Missing or unreadable file → I/O error.

See also