Extract Text (Azure)¶

Cloud OCR and layout extraction via Azure Document Intelligence. Processes scanned and born-digital documents, returning unified text plus optional tables and key-value pairs (with the layout model). Supports selective page processing and a hybrid mode that mixes native extraction with OCR for PDFs.

Note

Requires an Azure resource (endpoint + key). See Azure Document Intelligence.

Supported formats¶

Format	Notes
PDF	OCR or hybrid
DOCX	OCR for embedded images
JPG/JPEG	OCR
PNG	OCR
TIF/TIFF	OCR
GIF	OCR

Parameters¶

Parameter	Description
`input_data`	(str \| bytes \| Path) Path string, bytes, or `pathlib.Path`.
`extension`	(str, optional) Required if `input_data` is bytes (e.g. `"pdf"`).
`language_ocr`	(str, optional) OCR language code (ISO). Default `"eng"`.
`pages`	(int \| str \| list[int\|str] \| None) Page selection: `1`, `"1,3,5-7"` or mixed list `[1, 3, "5-7"]`. 1-based.
`azure_endpoint`	(str) Azure endpoint URL.
`azure_key`	(str) Azure API key.
`azure_model_id`	(str) `"prebuilt-read"` for text only; `"prebuilt-layout"` adds tables and key-value pairs.
`hybrid`	(bool, optional) PDFs: native text for text pages, OCR for raster. Default `False`.

Return value¶

CloudExtractionResult with:

Field	Meaning
`text`	Concatenated full text.
`text_pages`	List of page texts (one string per page).
`tables`	Raw tables as matrices `list[list[list[str]]]`.
`pretty_tables`	Tables rendered as readable ASCII blocks.
`key_value`	Dict of extracted key→values (layout model only).

Examples¶

Path string¶

import textwizard as tw

res = tw.extract_text_azure(
    "invoice.pdf",
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_key="<KEY>",
)
print(res.text)

Bytes (set `extension`)¶

from pathlib import Path
import textwizard as tw

raw = Path("scan.jpg").read_bytes()
res = tw.extract_text_azure(
    raw,
    extension="jpg",
    azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
    azure_key="<KEY>",
)
print(res.text)

Page selection (1-based)¶

import textwizard as tw

# single page
p1 = tw.extract_text_azure("report.pdf", pages=1, azure_endpoint="...", azure_key="...")

# ranges and lists
subset = tw.extract_text_azure(
    "report.pdf",
    pages=[1, 3, "5-7"],
    azure_endpoint="...",
    azure_key="...",
)

Text-only vs Layout (tables + key-value)¶

import textwizard as tw

# Fast, plain text
read = tw.extract_text_azure(
    "doc.pdf",
    azure_model_id="prebuilt-read",
    azure_endpoint="...",
    azure_key="...",
)
print(read.text_pages[:1])

# Layout: text + tables + key-value
layout = tw.extract_text_azure(
    "invoice.pdf",
    azure_model_id="prebuilt-layout",
    azure_endpoint="...",
    azure_key="...",
)
print(layout.pretty_tables)
print(layout.key_value)

Hybrid mode (PDF)¶

import textwizard as tw

res = tw.extract_text_azure(
    "mixed.pdf",
    azure_model_id="prebuilt-layout",
    hybrid=True,                    # native text for text pages, OCR for scanned pages
    azure_endpoint="...",
    azure_key="...",
)
print(len(res.text_pages), "pages")

DOCX with embedded images (OCR per image)¶

import textwizard as tw

docx = tw.extract_text_azure(
    "contract.docx",
    azure_model_id="prebuilt-layout",   # to collect tables/kv from images too
    language_ocr="ita",                 # image OCR locale; pages are 1-based
    pages=[1, 2],
    azure_endpoint="...",
    azure_key="...",
)
print(docx.text_pages[0])
print(docx.tables and docx.pretty_tables)

Operational notes¶

Use prebuilt-read for text extraction.
Use prebuilt-layout for tables and key-value pairs.
Hybrid reduces OCR cost on digitally born PDFs while handling scanned pages.
pages applies to PDF and DOCX (1-based). Ignored for single-image inputs.
Some 3-letter locales (e.g. "eng", "ita") are normalised to ISO-639-1 ("en", "it").
Azure request limits and file size constraints apply; consult the official docs.

Errors¶

Missing/invalid azure_endpoint or azure_key → authentication error.
Unsupported azure_model_id → configuration error.
Unsupported format or unreadable input → validation/I/O error.

Extract Text (Azure)¶

Supported formats¶

Parameters¶

Return value¶

Examples¶

Path string¶

Bytes (set extension)¶

Page selection (1-based)¶

Text-only vs Layout (tables + key-value)¶

Hybrid mode (PDF)¶

DOCX with embedded images (OCR per image)¶

Operational notes¶

Errors¶

See also¶

Bytes (set `extension`)¶