Named-Entity Recognition (NER)

Extract entities such as persons, organizations, locations, dates, and more from raw text. Backends: spaCy, Stanza, and spaCy-Stanza. Returns a structured EntitiesResult object with convenient accessors.

Note

Install NER extras first:

pip install "textwizard[ner]"
# Example spaCy model:
python -m spacy download en_core_web_sm

Overview

  • Engines - "spacy" – fastest startup and inference; uses spaCy pipelines. - "stanza" – often stronger accuracy for some languages; slower init. - "spacy_stanza" – spaCy tokenizer + Stanza NER.

  • Device selection - device="auto" uses GPU if available, else CPU. - "gpu" requires CUDA; raises if unavailable. - "cpu" forces CPU.

  • Models - spaCy: pass a model name (e.g., en_core_web_sm) or an absolute path. - Stanza: pass ISO language code (e.g., "en", "it").

  • Auto-download - Missing models are downloaded automatically.

Parameters

Parameter

Description

text

str. Non-empty Unicode string to analyze.

engine

'spacy' | 'stanza' | 'spacy_stanza'. Default "spacy".

model

spaCy model name or absolute path. Used only when engine="spacy". Default "en_core_web_sm".

language

ISO code for Stanza / spaCy-Stanza (e.g., "en", "it"). Default "en".

device

"auto" | "cpu" | "gpu". Default "auto".

Return value

EntitiesResult with:

  • entities: Dict[str, List[Entity]] grouped by label. Example keys: "PERSON", "ORG", "GPE", "DATE", … (depends on the model).

  • full_analysis: Dict[int, TokenAnalysis] per token (lemma, POS, dep, offsets, ent type).

  • Helper methods: - labelsList[str] - countsDict[str, int] - get(label)List[Entity] - to_dicts()List[dict] - most_common(n=5)List[Entity]

Examples

Basic usage (spaCy, English)

import textwizard as tw

sample = (
    "Alex Rivera traveled to Springfield to meet the team at Northstar Analytics "
    "on 14 March 2025. The next day he met Horizon Bank."
)
res = tw.extract_entities(sample)

# Access groups
persons = [e.text for e in res.entities.get("PERSON", [])]
orgs    = [e.text for e in res.entities.get("ORG", [])]
gpe     = [e.text for e in res.entities.get("GPE", [])]

print(res.labels)     # e.g. ['PERSON', 'GPE', 'ORG', 'DATE']
print(res.counts)     # e.g. {'PERSON': 1, 'GPE': 1, 'ORG': 2, 'DATE': 1}
print(persons, orgs, gpe)

Output

['PERSON', 'GPE', 'ORG', 'DATE']
{'PERSON': 1, 'GPE': 1, 'ORG': 2, 'DATE': 2}
['Alex Rivera'] ['Northstar Analytics', 'Horizon Bank'] ['Springfield']

Switch engine / model

import textwizard as tw

# Stanza (Italian), CPU
ita = tw.extract_entities(
    "Mario Rossi è nato a Milano nel 1980.",
    engine="stanza", language="it", device="cpu"
)

# spaCy with a larger English model
res_lg = tw.extract_entities(
    "Mario Rossi visited Paris.",
    engine="spacy", model="en_core_web_trf", device="gpu"   # transformer on GPU if available
)

# spaCy-Stanza hybrid on GPU (English)
hybrid = tw.extract_entities(
    "OpenAI is based in San Francisco.",
    engine="spacy_stanza", language="en", device="cpu"
)

Use absolute path to a spaCy model

import textwizard as tw
from pathlib import Path

custom_model = Path("/models/en_core_web_sm")
res = tw.extract_entities("Custom pipeline run.", engine="spacy", model=str(custom_model))

Consume EntitiesResult

import textwizard as tw

  text = "Tim Cook met Satya Nadella in Seattle on 2024-05-18."
  res = tw.extract_entities(text)

  # Flatten to list[dict] for JSON export
  payload = res.to_dicts()
  # Most common surface forms
  top = [e.text for e in res.most_common(3)]
  # Iterate labels
  for label, ents in res:
      print(label, [e.text for e in ents])

Output

PERSON ['Tim Cook', 'Satya Nadella']
GPE ['Seattle']
DATE ['2024-05-18']

Labels and coverage

Entity labels depend on the chosen model. Common labels include: PERSON, ORG, GPE, LOC, DATE, TIME, NORP, LAW, MONEY, PERCENT, EVENT, WORK_OF_ART, FAC, PRODUCT. Availability varies per language/model.

Errors

  • Empty or non-string text → validation error.

  • Unsupported engine or deviceValueError.

  • Missing libraries/models → RuntimeError with installation hint.

See also