Text statistics

Compute a compact suite of lexical diversity and distribution metrics from a string.

Note

All numeric values are rounded to 3 decimals. If there are fewer than 2 distinct tokens, zipf.slope and zipf.r2 are NaN.

What it computes

  • entropy – Shannon entropy (bits per token) of the token-frequency distribution.

  • zipf.slope / zipf.r2 – slope and R² of the linear fit on log10(rank) log10(freq).

  • vocab_gini – Gini coefficient of type-frequency inequality (0=uniform, 1=max inequality).

  • type_token_ratio|V| / N, unique types over total tokens.

  • hapax_ratio – share of tokens that occur exactly once.

  • simpson_index – Simpson diversity index: 1 - Σ (f_w / N)^2.

  • yule_k – Yule’s K (lexical concentration): 10^4 · i^2·V_i N) / N^2.

  • avg_word_length – average token length (characters), weighted by frequency.

Parameters

  • text (str): Input string to analyze.

Returns

dict

A dictionary with the following keys (all floats unless noted):

  • entropy

  • zipf (dict) → {"slope": float, "r2": float}

  • vocab_gini

  • type_token_ratio

  • hapax_ratio

  • simpson_index

  • yule_k

  • avg_word_length

Metric definitions

  • Shannon entropy: H = Σ_w (f_w / N) · log₂(f_w / N)

  • Zipf fit: Linear regression of x = log10(rank), y = log10(freq) → report slope and .

  • Gini: With frequencies v₁ v₂ v_m, total tokens N, types m: G = (2 · Σ_{i=1..m} i·v_i) / (m·N) (m + 1) / m

  • Type–Token Ratio: TTR = |V| / N

  • Hapax Ratio: (#types with frequency 1) / N

  • Simpson index: 1 Σ_w (f_w / N)²

  • Yule’s K: K = 10⁴ · (Σ_{i=1..M} i²·V_i N) / where V_i = #types with frequency i

  • Average word length: (Σ_w |w| · f_w) / N

Examples

Basic usage

import textwizard as tw

stats = tw.analyze_text_statistics("a a a b b c d e f g")
print(stats)

Output

{'entropy': 2.646, 'zipf': {'slope': -0.605, 'r2': 0.838}, 'vocab_gini': 0.229, 'type_token_ratio': 0.7, 'hapax_ratio': 0.5, 'simpson_index': 0.82, 'yule_k': 800.0, 'avg_word_length': 1.0}

Single repeated token:

import textwizard as tw
print(tw.analyze_text_statistics("hello hello hello"))

Output

{'entropy': -0.0, 'zipf': {'slope': nan, 'r2': nan}, 'vocab_gini': 0.0, 'type_token_ratio': 0.333, 'hapax_ratio': 0.0, 'simpson_index': 0.0, 'yule_k': 6666.667, 'avg_word_length': 5.0}

Notes

  • Tokenization is deliberately simple (text.lower().split()) to keep metrics stable and fast.

  • NaN appears for Zipf metrics when fewer than two distinct tokens are present.