==============================
Named-Entity Recognition (NER)
==============================

Extract entities such as persons, organizations, locations, dates, and more from raw text.
Backends: **spaCy**, **Stanza**, and **spaCy-Stanza**. Returns a structured
:class:`EntitiesResult` object with convenient accessors.

.. note::
   Install NER extras first:

   .. code-block:: bash

      pip install "textwizard[ner]"
      # Example spaCy model:
      python -m spacy download en_core_web_sm

Overview
========

- **Engines**
  - ``"spacy"`` – fastest startup and inference; uses spaCy pipelines.
  - ``"stanza"`` – often stronger accuracy for some languages; slower init.
  - ``"spacy_stanza"`` – spaCy tokenizer + Stanza NER.
- **Device selection**
  - ``device="auto"`` uses GPU if available, else CPU.
  - ``"gpu"`` requires CUDA; raises if unavailable.
  - ``"cpu"`` forces CPU.
- **Models**
  - spaCy: pass a model name (e.g., ``en_core_web_sm``) or an absolute path.
  - Stanza: pass ISO language code (e.g., ``"en"``, ``"it"``).
- **Auto-download**
  - Missing models are downloaded automatically.

Parameters
==========

.. list-table::
   :header-rows: 1
   :widths: 26 74

   * - **Parameter**
     - **Description**
   * - ``text``
     - ``str``. Non-empty Unicode string to analyze.
   * - ``engine``
     - ``'spacy' | 'stanza' | 'spacy_stanza'``. Default ``"spacy"``.
   * - ``model``
     - spaCy model name or absolute path. Used only when ``engine="spacy"``. Default ``"en_core_web_sm"``.
   * - ``language``
     - ISO code for Stanza / spaCy-Stanza (e.g., ``"en"``, ``"it"``). Default ``"en"``.
   * - ``device``
     - ``"auto" | "cpu" | "gpu"``. Default ``"auto"``.

Return value
============

``EntitiesResult`` with:

- ``entities``: ``Dict[str, List[Entity]]`` grouped by label. Example keys:
  ``"PERSON"``, ``"ORG"``, ``"GPE"``, ``"DATE"``, … (depends on the model).
- ``full_analysis``: ``Dict[int, TokenAnalysis]`` per token (lemma, POS, dep, offsets, ent type).
- Helper methods:
  - ``labels`` → ``List[str]``
  - ``counts`` → ``Dict[str, int]``
  - ``get(label)`` → ``List[Entity]``
  - ``to_dicts()`` → ``List[dict]``
  - ``most_common(n=5)`` → ``List[Entity]``


Examples
========

Basic usage (spaCy, English)
----------------------------

.. code-block:: python

   import textwizard as tw

   sample = (
       "Alex Rivera traveled to Springfield to meet the team at Northstar Analytics "
       "on 14 March 2025. The next day he met Horizon Bank."
   )
   res = tw.extract_entities(sample)

   # Access groups
   persons = [e.text for e in res.entities.get("PERSON", [])]
   orgs    = [e.text for e in res.entities.get("ORG", [])]
   gpe     = [e.text for e in res.entities.get("GPE", [])]

   print(res.labels)     # e.g. ['PERSON', 'GPE', 'ORG', 'DATE']
   print(res.counts)     # e.g. {'PERSON': 1, 'GPE': 1, 'ORG': 2, 'DATE': 1}
   print(persons, orgs, gpe)


**Output**  

   .. code-block:: text

    ['PERSON', 'GPE', 'ORG', 'DATE']
    {'PERSON': 1, 'GPE': 1, 'ORG': 2, 'DATE': 2}
    ['Alex Rivera'] ['Northstar Analytics', 'Horizon Bank'] ['Springfield']


Switch engine / model
---------------------

.. code-block:: python

   import textwizard as tw

   # Stanza (Italian), CPU
   ita = tw.extract_entities(
       "Mario Rossi è nato a Milano nel 1980.",
       engine="stanza", language="it", device="cpu"
   )

   # spaCy with a larger English model
   res_lg = tw.extract_entities(
       "Mario Rossi visited Paris.",
       engine="spacy", model="en_core_web_trf", device="gpu"   # transformer on GPU if available
   )

   # spaCy-Stanza hybrid on GPU (English)
   hybrid = tw.extract_entities(
       "OpenAI is based in San Francisco.",
       engine="spacy_stanza", language="en", device="cpu"
   )

Use absolute path to a spaCy model
----------------------------------

.. code-block:: python

   import textwizard as tw
   from pathlib import Path

   custom_model = Path("/models/en_core_web_sm")
   res = tw.extract_entities("Custom pipeline run.", engine="spacy", model=str(custom_model))

Consume EntitiesResult
----------------------

.. code-block:: python

  
 import textwizard as tw

   text = "Tim Cook met Satya Nadella in Seattle on 2024-05-18."
   res = tw.extract_entities(text)

   # Flatten to list[dict] for JSON export
   payload = res.to_dicts()
   # Most common surface forms
   top = [e.text for e in res.most_common(3)]
   # Iterate labels
   for label, ents in res:
       print(label, [e.text for e in ents])
       
       
**Output**  

   .. code-block:: text

    PERSON ['Tim Cook', 'Satya Nadella']
    GPE ['Seattle']
    DATE ['2024-05-18']


Labels and coverage
===================

Entity labels depend on the chosen model. Common labels include:
``PERSON``, ``ORG``, ``GPE``, ``LOC``, ``DATE``, ``TIME``, ``NORP``, ``LAW``, ``MONEY``,
``PERCENT``, ``EVENT``, ``WORK_OF_ART``, ``FAC``, ``PRODUCT``. Availability varies per language/model.

Errors
======

- Empty or non-string ``text`` → validation error.
- Unsupported ``engine`` or ``device`` → ``ValueError``.
- Missing libraries/models → ``RuntimeError`` with installation hint.

See also
========

- :doc:`lang_detect` — Language detection for routing to the right model
- :doc:`intro` — Overview and quick start