=================
Text statistics
=================

Compute a compact suite of lexical diversity and distribution metrics from a string.

.. note::
   All numeric values are rounded to **3 decimals**.  
   If there are fewer than **2 distinct tokens**, ``zipf.slope`` and ``zipf.r2`` are ``NaN``.

What it computes
================

- ``entropy`` – Shannon entropy (bits per token) of the token-frequency distribution.
- ``zipf.slope`` / ``zipf.r2`` – slope and R² of the linear fit on ``log10(rank) → log10(freq)``.
- ``vocab_gini`` – Gini coefficient of type-frequency inequality (0=uniform, 1=max inequality).
- ``type_token_ratio`` – ``|V| / N``, unique types over total tokens.
- ``hapax_ratio`` – share of tokens that occur exactly once.
- ``simpson_index`` – Simpson diversity index: ``1 - Σ (f_w / N)^2``.
- ``yule_k`` – Yule’s K (lexical concentration): ``10^4 · (Σ i^2·V_i − N) / N^2``.
- ``avg_word_length`` – average token length (characters), weighted by frequency.

Parameters
==========

- ``text`` (``str``): Input string to analyze.

Returns
=======

``dict``
    A dictionary with the following keys (all floats unless noted):

    - ``entropy``
    - ``zipf`` (``dict``) → ``{"slope": float, "r2": float}``
    - ``vocab_gini``
    - ``type_token_ratio``
    - ``hapax_ratio``
    - ``simpson_index``
    - ``yule_k``
    - ``avg_word_length``

Metric definitions
==================

- **Shannon entropy**:
  ``H = − Σ_w (f_w / N) · log₂(f_w / N)``

- **Zipf fit**:
  Linear regression of ``x = log10(rank)``, ``y = log10(freq)`` → report ``slope`` and ``R²``.

- **Gini**:
  With frequencies ``v₁ ≤ v₂ ≤ … ≤ v_m``, total tokens ``N``, types ``m``:
  ``G = (2 · Σ_{i=1..m} i·v_i) / (m·N) − (m + 1) / m``

- **Type–Token Ratio**:
  ``TTR = |V| / N``

- **Hapax Ratio**:
  ``(#types with frequency 1) / N``

- **Simpson index**:
  ``1 − Σ_w (f_w / N)²``

- **Yule’s K**:
  ``K = 10⁴ · (Σ_{i=1..M} i²·V_i − N) / N²`` where ``V_i`` = #types with frequency ``i``

- **Average word length**:
  ``(Σ_w |w| · f_w) / N``

Examples
========

Basic usage
-----------

.. code-block:: python

   import textwizard as tw

   stats = tw.analyze_text_statistics("a a a b b c d e f g")
   print(stats)

**Output**

.. code-block:: text

   {'entropy': 2.646, 'zipf': {'slope': -0.605, 'r2': 0.838}, 'vocab_gini': 0.229, 'type_token_ratio': 0.7, 'hapax_ratio': 0.5, 'simpson_index': 0.82, 'yule_k': 800.0, 'avg_word_length': 1.0}


Single repeated token:

.. code-block:: python

   import textwizard as tw
   print(tw.analyze_text_statistics("hello hello hello"))

**Output**

.. code-block:: text

   {'entropy': -0.0, 'zipf': {'slope': nan, 'r2': nan}, 'vocab_gini': 0.0, 'type_token_ratio': 0.333, 'hapax_ratio': 0.0, 'simpson_index': 0.0, 'yule_k': 6666.667, 'avg_word_length': 5.0}

Notes
=====

- Tokenization is deliberately simple (``text.lower().split()``) to keep metrics stable and fast.
- ``NaN`` appears for Zipf metrics when fewer than two distinct tokens are present.