====================
Extract Text (Azure)
====================
Cloud OCR and layout extraction via **Azure Document Intelligence**.
Processes scanned and born-digital documents, returning unified text plus optional **tables** and **key-value pairs** (with the *layout* model).
Supports selective **page** processing and a **hybrid** mode that mixes native extraction with OCR for PDFs.
.. note::
Requires an Azure resource (endpoint + key). See
`Azure Document Intelligence `_.
Supported formats
=================
+-----------+----------------------------+
| Format | Notes |
+===========+============================+
| PDF | OCR or hybrid |
+-----------+----------------------------+
| DOCX | OCR for embedded images |
+-----------+----------------------------+
| JPG/JPEG | OCR |
+-----------+----------------------------+
| PNG | OCR |
+-----------+----------------------------+
| TIF/TIFF | OCR |
+-----------+----------------------------+
| GIF | OCR |
+-----------+----------------------------+
Parameters
==========
+---------------------+--------------------------------------------------------------------------+
| **Parameter** | **Description** |
+=====================+==========================================================================+
| ``input_data`` | (*str | bytes | Path*) Path string, bytes, or ``pathlib.Path``. |
+---------------------+--------------------------------------------------------------------------+
| ``extension`` | (*str, optional*) Required if ``input_data`` is bytes (e.g. ``"pdf"``). |
+---------------------+--------------------------------------------------------------------------+
| ``language_ocr`` | (*str, optional*) OCR language code (ISO). Default ``"eng"``. |
+---------------------+--------------------------------------------------------------------------+
| ``pages`` | (*int | str | list[int|str] | None*) Page selection: ``1``, |
| | ``"1,3,5-7"`` or mixed list ``[1, 3, "5-7"]``. **1-based.** |
+---------------------+--------------------------------------------------------------------------+
| ``azure_endpoint`` | (*str*) Azure endpoint URL. |
+---------------------+--------------------------------------------------------------------------+
| ``azure_key`` | (*str*) Azure API key. |
+---------------------+--------------------------------------------------------------------------+
| ``azure_model_id`` | (*str*) ``"prebuilt-read"`` for text only; ``"prebuilt-layout"`` adds |
| | tables and key-value pairs. |
+---------------------+--------------------------------------------------------------------------+
| ``hybrid`` | (*bool, optional*) PDFs: native text for text pages, OCR for raster. |
| | Default ``False``. |
+---------------------+--------------------------------------------------------------------------+
Return value
============
``CloudExtractionResult`` with:
+-------------------+-----------------------------------------------------------+
| Field | Meaning |
+===================+===========================================================+
| ``text`` | Concatenated full text. |
+-------------------+-----------------------------------------------------------+
| ``text_pages`` | List of page texts (one string per page). |
+-------------------+-----------------------------------------------------------+
| ``tables`` | Raw tables as matrices ``list[list[list[str]]]``. |
+-------------------+-----------------------------------------------------------+
| ``pretty_tables`` | Tables rendered as readable ASCII blocks. |
+-------------------+-----------------------------------------------------------+
| ``key_value`` | Dict of extracted key→values (layout model only). |
+-------------------+-----------------------------------------------------------+
Examples
========
Path string
-----------
.. code-block:: python
import textwizard as tw
res = tw.extract_text_azure(
"invoice.pdf",
azure_endpoint="https://.cognitiveservices.azure.com/",
azure_key="",
)
print(res.text)
Bytes (set ``extension``)
-------------------------
.. code-block:: python
from pathlib import Path
import textwizard as tw
raw = Path("scan.jpg").read_bytes()
res = tw.extract_text_azure(
raw,
extension="jpg",
azure_endpoint="https://.cognitiveservices.azure.com/",
azure_key="",
)
print(res.text)
Page selection (1-based)
------------------------
.. code-block:: python
import textwizard as tw
# single page
p1 = tw.extract_text_azure("report.pdf", pages=1, azure_endpoint="...", azure_key="...")
# ranges and lists
subset = tw.extract_text_azure(
"report.pdf",
pages=[1, 3, "5-7"],
azure_endpoint="...",
azure_key="...",
)
Text-only vs Layout (tables + key-value)
----------------------------------------
.. code-block:: python
import textwizard as tw
# Fast, plain text
read = tw.extract_text_azure(
"doc.pdf",
azure_model_id="prebuilt-read",
azure_endpoint="...",
azure_key="...",
)
print(read.text_pages[:1])
# Layout: text + tables + key-value
layout = tw.extract_text_azure(
"invoice.pdf",
azure_model_id="prebuilt-layout",
azure_endpoint="...",
azure_key="...",
)
print(layout.pretty_tables)
print(layout.key_value)
Hybrid mode (PDF)
-----------------
.. code-block:: python
import textwizard as tw
res = tw.extract_text_azure(
"mixed.pdf",
azure_model_id="prebuilt-layout",
hybrid=True, # native text for text pages, OCR for scanned pages
azure_endpoint="...",
azure_key="...",
)
print(len(res.text_pages), "pages")
DOCX with embedded images (OCR per image)
-----------------------------------------
.. code-block:: python
import textwizard as tw
docx = tw.extract_text_azure(
"contract.docx",
azure_model_id="prebuilt-layout", # to collect tables/kv from images too
language_ocr="ita", # image OCR locale; pages are 1-based
pages=[1, 2],
azure_endpoint="...",
azure_key="...",
)
print(docx.text_pages[0])
print(docx.tables and docx.pretty_tables)
Operational notes
=================
- Use ``prebuilt-read`` for text extraction.
- Use ``prebuilt-layout`` for tables and key-value pairs.
- **Hybrid** reduces OCR cost on digitally born PDFs while handling scanned pages.
- ``pages`` applies to **PDF** and **DOCX** (1-based). Ignored for single-image inputs.
- Some 3-letter locales (e.g. ``"eng"``, ``"ita"``) are normalised to ISO-639-1 (``"en"``, ``"it"``).
- Azure request limits and file size constraints apply; consult the official docs.
Errors
======
- Missing/invalid ``azure_endpoint`` or ``azure_key`` → authentication error.
- Unsupported ``azure_model_id`` → configuration error.
- Unsupported format or unreadable input → validation/I/O error.
See also
========
- :doc:`extract_text` — Local extraction with optional Tesseract OCR
- :doc:`intro` — Overview and quick start