Extract Text (Azure)¶
Cloud OCR and layout extraction via Azure Document Intelligence. Processes scanned and born-digital documents, returning unified text plus optional tables and key-value pairs (with the layout model). Supports selective page processing and a hybrid mode that mixes native extraction with OCR for PDFs.
Note
Requires an Azure resource (endpoint + key). See Azure Document Intelligence.
Supported formats¶
Format |
Notes |
|---|---|
OCR or hybrid |
|
DOCX |
OCR for embedded images |
JPG/JPEG |
OCR |
PNG |
OCR |
TIF/TIFF |
OCR |
GIF |
OCR |
Parameters¶
Parameter |
Description |
|---|---|
|
(str | bytes | Path) Path string, bytes, or |
|
(str, optional) Required if |
|
(str, optional) OCR language code (ISO). Default |
|
(int | str | list[int|str] | None) Page selection: |
|
(str) Azure endpoint URL. |
|
(str) Azure API key. |
|
(str) |
|
(bool, optional) PDFs: native text for text pages, OCR for raster.
Default |
Return value¶
CloudExtractionResult with:
Field |
Meaning |
|---|---|
|
Concatenated full text. |
|
List of page texts (one string per page). |
|
Raw tables as matrices |
|
Tables rendered as readable ASCII blocks. |
|
Dict of extracted key→values (layout model only). |
Examples¶
Path string¶
import textwizard as tw
res = tw.extract_text_azure(
"invoice.pdf",
azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
azure_key="<KEY>",
)
print(res.text)
Bytes (set extension)¶
from pathlib import Path
import textwizard as tw
raw = Path("scan.jpg").read_bytes()
res = tw.extract_text_azure(
raw,
extension="jpg",
azure_endpoint="https://<resource>.cognitiveservices.azure.com/",
azure_key="<KEY>",
)
print(res.text)
Page selection (1-based)¶
import textwizard as tw
# single page
p1 = tw.extract_text_azure("report.pdf", pages=1, azure_endpoint="...", azure_key="...")
# ranges and lists
subset = tw.extract_text_azure(
"report.pdf",
pages=[1, 3, "5-7"],
azure_endpoint="...",
azure_key="...",
)
Text-only vs Layout (tables + key-value)¶
import textwizard as tw
# Fast, plain text
read = tw.extract_text_azure(
"doc.pdf",
azure_model_id="prebuilt-read",
azure_endpoint="...",
azure_key="...",
)
print(read.text_pages[:1])
# Layout: text + tables + key-value
layout = tw.extract_text_azure(
"invoice.pdf",
azure_model_id="prebuilt-layout",
azure_endpoint="...",
azure_key="...",
)
print(layout.pretty_tables)
print(layout.key_value)
Hybrid mode (PDF)¶
import textwizard as tw
res = tw.extract_text_azure(
"mixed.pdf",
azure_model_id="prebuilt-layout",
hybrid=True, # native text for text pages, OCR for scanned pages
azure_endpoint="...",
azure_key="...",
)
print(len(res.text_pages), "pages")
DOCX with embedded images (OCR per image)¶
import textwizard as tw
docx = tw.extract_text_azure(
"contract.docx",
azure_model_id="prebuilt-layout", # to collect tables/kv from images too
language_ocr="ita", # image OCR locale; pages are 1-based
pages=[1, 2],
azure_endpoint="...",
azure_key="...",
)
print(docx.text_pages[0])
print(docx.tables and docx.pretty_tables)
Operational notes¶
Use
prebuilt-readfor text extraction.Use
prebuilt-layoutfor tables and key-value pairs.Hybrid reduces OCR cost on digitally born PDFs while handling scanned pages.
pagesapplies to PDF and DOCX (1-based). Ignored for single-image inputs.Some 3-letter locales (e.g.
"eng","ita") are normalised to ISO-639-1 ("en","it").Azure request limits and file size constraints apply; consult the official docs.
Errors¶
Missing/invalid
azure_endpointorazure_key→ authentication error.Unsupported
azure_model_id→ configuration error.Unsupported format or unreadable input → validation/I/O error.
See also¶
Extract Text — Local extraction with optional Tesseract OCR
TextWizard — Overview and quick start